generalization error
Kernel Renormalization in Bayesian Deep Neural Networks: the Equivalent Wishart Ansatz in the Proportional Regime
Baglioni, Paolo, Keup, Christian, Zimbardo, Vincenzo, Pacelli, Rosalba, Vezzani, Alessandro, Burioni, Raffaella, Rotondo, Pietro
The scaling limit where both the size of the training set $P$ and the width $N$ of a deep neural network grow at the same rate, the so-called proportional-width regime, has been intensely studied for shallow, single-hidden-layer networks. However, extending these non-perturbative results from shallow architectures to deep non-linear networks has proven very challenging. Here we present an effective approximate approach to predict the generalization performance of Bayesian multi-layer perceptrons (MLPs) of fixed depth $L$ on arbitrary high-dimensional data. We propose an equivalent Wishart Ansatz to capture the dominant stochastic fluctuations of the hierarchical empirical kernels of MLPs. This allows us to perform a large deviation analysis for the partition function of MLPs in the proportional limit, expressed in terms of a renormalized NNGP kernel. In this description, even strong representation learning in the proportional limit is encoded in at most $L$ scalar order parameters, determined self-consistently. Extending the approach to convolutional architectures (CNNs), we identify a hierarchical local kernel renormalization mechanism, which allows to quantify more complex data-dependent transformations of the large-width kernel in CNNs due to finite-width effects. We test our effective theory against sampling experiments from the Bayesian posterior of finite deep neural networks with depths $L \sim O(10)$ and $P\sim O(10^3)$ on classic benchmark datasets, finding overall very good agreement together with two distinct types of systematic deviations.
Signal-to-Noise Ratio and Sample Size Govern Representational Alignment in Neural Networks
Umar, Ali Hussaini, Laio, Alessandro
Neural networks are known to develop latent representations that are $aligned$, namely structurally similar across networks trained with different architectures, training protocols, or training datasets. We study this phenomenon in a controlled setting, where we train an ensemble of networks on regression and classification tasks using training sets perturbed by independent realizations of a noise process. We show that the signal-to-noise ratio (SNR) and the training sample size influence the alignment in qualitatively similar ways in networks trained on real-world datasets and in an extremely simple $linear$ network with a single hidden layer, for which the alignment can be estimated analytically. Across linear and nonlinear networks, regression and classification tasks, and both synthetic and real-world data, we consistently observe that alignment varies monotonically with SNR but non-monotonically with training sample size. In particular, the alignment is minimized near the interpolation threshold, and a stronger alignment does not necessarily correspond to better generalization error. These findings reveal a non-trivial dependence of alignment on data quality and quantity, decoupled from generalization performance.
Representation Gap: Explaining the Unreasonable Effectiveness of Neural Networks from a Geometric Perspective
Perera, David, Moura, Victor, Santos, Lais Isabelle Alves dos, Haddad, Michel F. C., Figueiredo, Flavio
Characterizing precisely the asymptotic generalization error of neural networks using parameters that can be estimated efficiently is a crucial problem in machine learning, which relies heavily on heuristics and practitioners' intuition to make key design choices. In order to mitigate this issue, we introduce the Representation Gap, a metric closely related to the generalization error, but admitting better-behaved asymptotic dynamics. Focusing on equivariant diffusion models and leveraging results from optimal quantization and point-process theory, we derive a precise asymptotic equivalent of the Representation Gap and show that it is governed by a single parameter, the \textit{intrinsic dimension} of the task, which is easy to interpret, efficient to estimate, and can be linked to the equivariances of common neural network architectures. We show that this asymptotic dynamic also extends to a broader range of tasks and training algorithms. Finally, we demonstrate empirically that our asymptotic law and intrinsic dimension estimation are accurate on a wide range of synthetic datasets, where these quantities are known, as well as on more realistic datasets, where we obtain results consistent with the related literature.
MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
Huang, Feihu, Luo, Yuning, Chen, Songcan
Matrix-structured parameters frequently appear in many artificial intelligence models such as large language models. More recently, an efficient Muon optimizer is designed for matrix parameters of large-scale models, and shows markedly faster convergence than the vector-wise algorithms. Although some works have begun to study convergence properties (i.e., optimization error) of the Muon optimizer, its generalization properties (i.e., generalization error) is still not established. Thus, in this paper, we study generalization error of the Muon optimizer based on algorithmic stability and mathematical induction, and prove that the Muon has a generalization error of $O\big(\frac{1}{Nฮบ^{T}}\big)$, where $N$ is training sample size, and $T$ denotes iteration number, and $ฮบ>0$ denotes minimum difference between singular values of gradient estimate. To enhance generalization of the Muon, we propose an effective mixed Muon (MiMuon) optimizer by cautiously using orthogonalization of gradient, which is a hybrid of Muon and momentum-based SGD optimizers. Then we prove that our MiMuon optimizer has a lower generalization error of $O\big(\frac{1}{N}\big)$ than $O\big(\frac{1}{Nฮบ^{T}}\big)$ of Muon optimizer, since $ฮบ$ generally is very small. Meanwhile, we also studied the convergence properties of our MiMuon algorithm, and prove that our MiMuon algorithm has the same convergence rate of $O(\frac{1}{T^{1/4}})$ as the Muon algorithm. Some numerical experimental results on training large models including Qwen3-0.6B and YOLO26m demonstrate efficiency of the MiMuon optimizer.
On Kernel Eigen-alignments of KRR: Reconstruction and Generalization
Liu, Yang, Fokoue, Ernest, Lange, Richard, Krutz, Daniel
This paper investigates the critical role of eigenalignments between the kernel matrix and learning targets in achieving robust generalization in learning problems. We establish a direct connection between generalization performance in kernel methods and the estimation of eigenvectors and eigenvalues of matrices, offering a more intuitive understanding compared to prior work with minimal assumptions. We also show that, since the prediction task in KRR is essentially the weighted sum of eigenvectors/singular vectors, by analyzing how much error can be caused by perturbations to the kernel matrix, we can then derive a bound on this generalization error using the estimation stability of matrix eigenvalues and eigenvectors. Compared with previous work, our analysis concentrates on finite-sample settings and on the generalization error arising from having a suboptimal finite training set. Our findings reveal that in kernel methods, as long as the kernel is of high rank, the near-zero reconstruction error can be trivially obtained, implying that the reconstruction error will have limited predictive power for generalization. Finally, we establish a generalization bound from an eigenvalues/eigenvectors estimation perspective, showing that strong generalization requires increasing eigenvector alignment, eigenvalue magnitude, or gaps between consecutive eigenvalues.
Large Dimensional Kernel Ridge Regression: Extending to Product Kernels
Zhou, Yang, Li, Yicheng, Cheng, Yuqian, Lin, Qian
Recent studies have reported $\textit{saturation effects}$ and $\textit{multiple descent behavior}$ in large dimensional kernel ridge regression (KRR). However, these findings are predominantly derived under restrictive settings, such as inner product kernels on sphere or strong eigenfunction assumptions like hypercontractivity. Whether such behaviors hold for other kernels remains an open question. In this paper, we establish a broad, new family of large dimensional kernels and derive the corresponding convergence rates of the generalization error. As a result, we recover key phenomena previously associated with inner product kernels on sphere, including: $i)$ the $\textit{minimax optimality}$ when the source condition $s\le 1$; $ii)$ the $\textit{saturation effect}$ when $s>1$; $iii)$ a $\textit{periodic plateau phenomenon}$ in the convergence rate and a $\textit {multiple-descent behavior}$ with respect to the sample size $n$.
Characterizing the Generalization Error of Random Feature Regression with Arbitrary Data-Augmentation
Morisset, Lucas, Durmus, Alain, Hardy, Adrien
Data augmentation (DA) is now a standard ingredient in modern machine learning pipelines, with extensive empirical evidence reporting improvements in generalization across modalities and tasks Mumuni and Mumuni (2022); Wang et al. (2025). It is often used to encode task-relevant symmetries directly into the training procedure, for instance by encouraging invariance to image rotations or other transformations of the input Shorten and Khoshgoftaar (2019); Chen et al. (2020). It has also been identified as one of the most effective regularization techniques across both supervised learning settings Bishop (1995); Cubuk et al. (2019); Mumuni and Mumuni (2022); Wang et al. (2025) and self-supervised/unsupervised learning Feng et al. (2021); Van Assel et al. (2025). Domain-specific augmentation pipelines have been central to progress in computer vision Shorten and Khoshgoftaar (2019); Kumar et al. (2024), natural language processing Feng et al. (2021); Shorten et al. (2021); Bayer et al. (2022), and time-series or audio applications Wen et al. (2020); Iwana and Uchida (2021); Iglesias et al. (2023). Despite these empirical successes, the benefits of DA remain highly task-and data-dependent, and augmentation schemes are often engineered in an ad hoc manner Fawzi et al. (2016); Cubuk et al. (2019); Lim et al. (2019); Hataya et al. (2020). In contrast with this rich empirical literature, comprehensive theoretical analyses of DA remain relatively scarce. Two classical starting points are, first, the interpretation of additive Gaussian noise as a form of explicit (ridge-like) regularization Bishop (1995); Lin et al. (2024), and second, the idea that leveraging distributional invariances and group structure in the learning objective helps decrease the variance of the model without increasing its bias Chen et al. (2020). Yet, when applied to modern and complex augmentation schemes, these works either provide only upper bounds on the generalization error Lin et al. (2024), or require very strong assumptions on the data distribution (e.g.
Optimal Confidence Band for Kernel Gradient Flow Estimator
Cheng, Yuqian, Chen, Zhuo, Lin, Qian
In this paper, we investigate the supremum-norm generalization error and the uniform inference for a specific class of kernel regression methods, namely the kernel gradient flows. Under the widely adopted capacity-source condition framework in the kernel regression literature, we first establish convergence rates for the supremum norm generalization error of both continuous and discrete kernel gradient flows under the source condition $s>ฮฑ_0$, where $ฮฑ_0\in(0,1)$ denotes the embedding index of the kernel function. Moreover, we show that these rates match the minimax optimal rates. Building on this result, we then construct simultaneous confidence bands for both continuous and discrete kernel gradient flows. Notably, the widths of the proposed confidence bands are also optimal, in the sense that their shrinkage rates are greater than, while can be arbitrarily close to, the minimax optimal rates.
A Hierarchical Sampling Framework for bounding the Generalization Error of Federated Learning
Filatrella, Dario, Thobaben, Ragnar, Skoglund, Mikael
We study expected generalization bounds for the Hierarchical Federated Learning (HFL) setup using Wasserstein distance. We introduce a generalized framework in which data is sampled hierarchically, and we model it with a multi-layered tree structure that induces dependencies among the clients' datasets. We derive generalization bounds in terms of Wasserstein distance under the Lipschitz assumption on the loss function, by applying a supersample construction that allows us to measure the sensitivity of the algorithm to the change of a single node in the sampling tree. By leveraging the FL structure, we recover and strictly imply existing state-of-the-art conditional mutual information (CMI) bounds in the case of bounded losses. We also show that our bound can be applied together with Differential Privacy assumptions, to recover generalization bounds based on algorithmic privacy. To assess the tightness of our bounds, we study the Gaussian Location Model (GLM) and show that we recover the actual asymptotic rate of the generalization error.